05. Text Processing, Bag of Words

Text Processing

I mentioned that one of our tasks will be to convert any user input text into data that our deployed model can see as input. You've seen a few examples of text pre-processing and the steps usually go something like this:

  1. Get rid of any special characters like punctuation
  2. Convert all text to lowercase and split into individual words
  3. Create a vocabulary that assigns each unique word a numerical value or converts words into a vector of numbers

This last step is often called word tokenization or vectorization.

And in the next example, you'll see exactly how I do these processing steps; I'll also be vectorizing words using a method called bag of words . If you'd like to learn more about bag of words, please check out the video below, recorded by another of our instructors, Arpan!

Bag of Words

Bag Of Words

You can read more about the bag of words model, and its applications, on this page . It's a useful way to represent words based on their frequency of occurrence in a text.

Quiz

For the following quiz questions, consider a "document" that is just the following sentence:

At a very basic level, I think people need the opportunity to learn and to grow.

Given only this document, what is the size of the vocabulary (the unique words in the document)?

SOLUTION: 15